factoid question
Are Smaller Open-Weight LLMs Closing the Gap to Proprietary Models for Biomedical Question Answering?
Stachura, Damian, Konieczna, Joanna, Nowak, Artur
Open-weight versions of large language models (LLMs) are rapidly advancing, with state-of-the-art models like DeepSeek-V3 now performing comparably to proprietary LLMs. This progression raises the question of whether small open-weight LLMs are capable of effectively replacing larger closed-source models. We are particularly interested in the context of biomedical question-answering, a domain we explored by participating in Task 13B Phase B of the BioASQ challenge. In this work, we compare several open-weight models against top-performing systems such as GPT-4o, GPT-4.1, Claude 3.5 Sonnet, and Claude 3.7 Sonnet. To enhance question answering capabilities, we use various techniques including retrieving the most relevant snippets based on embedding distance, in-context learning, and structured outputs. For certain submissions, we utilize ensemble approaches to leverage the diverse outputs generated by different models for exact-answer questions. Our results demonstrate that open-weight LLMs are comparable to proprietary ones. In some instances, open-weight LLMs even surpassed their closed counterparts, particularly when ensembling strategies were applied. All code is publicly available at https://github.com/evidenceprime/BioASQ-13b.
- Europe > Spain > Galicia > Madrid (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > Poland > Lesser Poland Province > Kraków (0.04)
- (2 more...)
LLM Ensemble for RAG: Role of Context Length in Zero-Shot Question Answering for BioASQ Challenge
Galat, Dima, Molla-Aliod, Diego
Biomedical question answering (QA) poses significant challenges due to the need for precise interpretation of specialized knowledge drawn from a vast, complex, and rapidly evolving corpus. In this work, we explore how large language models (LLMs) can be used for information retrieval (IR), and an ensemble of zero-shot models can accomplish state-of-the-art performance on a domain-specific Yes/No QA task. Evaluating our approach on the BioASQ challenge tasks, we show that ensembles can outperform individual LLMs and in some cases rival or surpass domain-tuned systems - all while preserving generalizability and avoiding the need for costly fine-tuning or labeled data. Our method aggregates outputs from multiple LLM variants, including models from Anthropic and Google, to synthesize more accurate and robust answers. Moreover, our investigation highlights a relationship between context length and performance: while expanded contexts are meant to provide valuable evidence, they simultaneously risk information dilution and model disorientation. These findings emphasize IR as a critical foundation in Retrieval-Augmented Generation (RAG) approaches for biomedical QA systems. Precise, focused retrieval remains essential for ensuring LLMs operate within relevant information boundaries when generating answers from retrieved documents. Our results establish that ensemble-based zero-shot approaches, when paired with effective RAG pipelines, constitute a practical and scalable alternative to domain-tuned systems for biomedical question answering.
Using Pretrained Large Language Model with Prompt Engineering to Answer Biomedical Questions
Our team participated in the BioASQ 2024 Task12b and Synergy tasks to build a system that can answer biomedical questions by retrieving relevant articles and snippets from the PubMed database and generating exact and ideal answers. We propose a two-level information retrieval and question-answering system based on pre-trained large language models (LLM), focused on LLM prompt engineering and response post-processing. We construct prompts with in-context few-shot examples and utilize post-processing techniques like resampling and malformed response detection. We compare the performance of various pre-trained LLM models on this challenge, including Mixtral, OpenAI GPT and Llama2. Our best-performing system achieved 0.14 MAP score on document retrieval, 0.05 MAP score on snippet retrieval, 0.96 F1 score for yes/no questions, 0.38 MRR score for factoid questions and 0.50 F1 score for list questions in Task 12b.
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- (4 more...)
ASQA: Factoid Questions Meet Long-Form Answers
Stelmakh, Ivan, Luan, Yi, Dhingra, Bhuwan, Chang, Ming-Wei
An abundance of datasets and availability of reliable evaluation metrics have resulted in strong progress in factoid question answering (QA). This progress, however, does not easily transfer to the task of long-form QA, where the goal is to answer questions that require in-depth explanations. The hurdles include (i) a lack of high-quality data, and (ii) the absence of a well-defined notion of the answer's quality. In this work, we address these problems by (i) releasing a novel dataset and a task that we call ASQA (Answer Summaries for Questions which are Ambiguous); and (ii) proposing a reliable metric for measuring performance on ASQA. Our task focuses on factoid questions that are ambiguous, that is, have different correct answers depending on interpretation. Answers to ambiguous questions should synthesize factual information from multiple sources into a long-form summary that resolves the ambiguity. In contrast to existing long-form QA tasks (such as ELI5), ASQA admits a clear notion of correctness: a user faced with a good summary should be able to answer different interpretations of the original ambiguous question. We use this notion of correctness to define an automated metric of performance for ASQA. Our analysis demonstrates an agreement between this metric and human judgments, and reveals a considerable gap between human performance and strong baselines.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > France (0.05)
- North America > Dominican Republic (0.04)
- (11 more...)
The combination of context information to enhance simple question answering
Abstract--With the rapid development of knowledge base, question answering based on knowledge base has been a hot research issue. In this paper, we focus on answering singlerelation factoid questions based on knowledge base. We build a question answering system and study the effect of context information on fact selection, such as entity's notable type, outdegree. Experimental results show that context information can improve the result of simple question answering. Question answering (QA) is a classic natural language processing task, which aims at building systems that automatically answer questions formulated in natural language [1]. In recent years, several large-scale general purpose knowledge bases (KBs) have been constructed, including Freebase [2], YAGO [3], DBpedia [4] and Wikidata [5] .
- Asia > China > Hubei Province > Wuhan (0.05)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.05)
- Asia > South Korea > Busan > Busan (0.04)
- Leisure & Entertainment (0.94)
- Media > Music (0.69)
- Government > Regional Government > North America Government > United States Government (0.47)
Artificial intelligence: ARC test focus goes beyond factoid questions
"Common sense" is a phrase everyone hears at one time or another, usually from an angry bystander who think you don't have any. "Humans use common sense to fill in the gaps of any question they are posed, delivering answers within an understood but non-explicit context," Swapna Krishna wrote in Engadget. Add a few years of developmental growth in the young child, and he or she acquires common sense but AI has problems. Calling out the challenge in AI research is Dr. Oren Etzioni, researcher and professor, who leads the Allen Institute for Artificial Intelligence, or AI2, in Seattle, Washington. To get at the fluidity that people have, their natural ability to move from one thing to the next, the programs need what every ten year old has in spades, he said, and that is called common sense---a set of facts, heuristics, observations, all the things that we can bring to the table, but the computer does not.